NAR Genomics and Bioinformatics — Latest Matching Preprints

1

Gene-specific exponent-corrected normalization for library size in bulk RNA-seq

Yin, R.; Li, D.; Zong, W.; Ketchesin, K. D.; Seney, M. L.; McClung, C. A.; Baldoni, P. L.; Tseng, G. C.

2026-07-09 bioinformatics 10.64898/2026.07.04.736167 medRxiv

Top 0.1%

14.9%

Show abstract

Correcting for library size is an essential step in bulk RNA-seq analyses, as differences in sequencing depth across samples can obscure biological signal with technical noise. While numerous normalization methods and model-based strategies have been proposed, we demonstrate here that library size-normalized counts and differential expression results obtained from such widely adopted approaches often remain strongly correlated with library size in large-scale RNA-seq experiments. Through a systematic analysis of over 100 publicly available GEO and TCGA RNA-seq datasets with raw count data, we show that library size association is observed for a substantial proportion of genes even after state-of-the-art library size correction approaches recommended by leading normalization tools. To address this issue, we propose gecco, a gene-specific exponent-corrected normalization method for RNA-seq counts that incorporates library size directly into the statistical framework via a gene-specific correction term, rather than applying a uniform adjustment factor across all genes. This formulation generalizes existing normalization approaches and yields normalized counts that are free of residual library size effects. Using both simulation studies and real large-scale RNA-seq datasets, we show that our method mitigates library size bias while preserving biological signal across a range of parameter settings. We further demonstrate that our approach leads to higher detection accuracy and more biologically meaningful pathway enrichment results in downstream differential expression and rhythmicity analyses without compromising false discovery rate control. Our method is implemented in R and is fully compatible with the widely used differential expression analysis methods DESeq2 and edgeR.

2

KozakExplorer: an interactive framework for genome-wide Kozak sequence analysis

Cokelaer, T.; Santi, A. M. M.; Pipoli da Fonseca, J.; Spaeth, G. F.

2026-06-26 bioinformatics 10.64898/2026.06.25.734688 medRxiv

Top 0.1%

14.9%

Show abstract

Translation initiation signals shape gene expression across all domains of life. In eukaryotes, nucleotide constraints surrounding the start codon are commonly described by the Kozak Consensus Sequence (KCS), whereas in bacteria and archaea, initiation frequently involves Shine--Dalgarno ribosome-binding motifs. Although these signals have been extensively characterized in model organisms, their large-scale diversity and evolutionary distribution remain incompletely explored. We present KozakExplorer, a reproducible framework for quantitative and comparative analysis of translation initiation contexts from genome assemblies and annotations. The software performs strand-aware extraction of start codon environments from FASTA and GFF3 files and applies information-theoretic metrics---including Kullback--Leibler (KL) divergence and information content (IC)---to measure positional nucleotide constraints relative to a background model. Derived summary statistics (Kozak Strength Index [KSI], maximum information content, peak position) convert motif patterns into interpretable per-genome signatures suitable for cross-species comparison. Our primary analysis covers 2,282 eukaryotic reference genomes, producing a standardized dataset of translation initiation metrics. Dimensionality reduction via t-SNE on per-position KL divergence, information content, and motif nucleotide frequencies reveals a structured eukaryotic KCS landscape with kingdom-level clustering and continuous variation in signal strength. A dedicated case study of 216 Apicomplexa genomes shows genus-level structure consistent with host range and phylogeny. An extended analysis across 25,344 reference genomes (22,253 bacteria, 809 archaea) places eukaryotic patterns in a global comparative framework, revealing transitions between sharply localized Kozak motifs and distributed Shine--Dalgarno-type signatures. Implemented within the open-source Sequana ecosystem, KozakExplorer is distributed as a Python module and an interactive web application that accepts local annotated assemblies, GenBank records, or NCBI RefSeq accessions, and exports all computed metrics, embeddings, and coordinates for downstream comparative and evolutionary genomics.

3

Trinucleotide Distribution, Symmetry Elements and Formulation of Mirror Symmetry Index for G4 Motifs

Arya, A.; Datta, B.

2026-07-05 bioinformatics 10.64898/2026.07.05.736592 medRxiv

Top 0.1%

12.8%

Show abstract

Symmetry elements in nucleic acids are most strongly correlated with sites of biological function; however, their relevance to non-canonical structures remains underexplored. In this study, we demonstrate the presence and significance of trinucleotide symmetry elements within G-quadruplex (G4) motifs. Our central hypothesis is that the intra-strand mirror symmetry of trinucleotides has been evolutionarily selected to facilitate G4 formation builds on the established sequence-structure association of G-quadruplexes and the natural symmetry law governing nucleotide insertion during genome evolution. Using a conserved G4 motif in the first exon of the MTOR gene as a model, we showed remarkable trinucleotide symmetry preservation across primates and broader mammals, with functional G4 regions displaying locally elevated symmetry relative to the codon-biased exonic background. Analysis of experimentally validated oncogenic G4s, including c-MYC, BCL2, VEGF, and KRAS, revealed that mirror and reverse complement symmetries converge around biologically important G4s. To quantify this feature, we formulated two complementary descriptors: the mirror symmetry index (MSI) and its non-palindromic variant (nMSI). Across 14 oncogene-promoter wild-type G4s, the majority scored MSI [≥] 0.80 (mean 0.884), with only the loop-rich ATG7, BCR, and MDM2 motifs falling below this value, and the KRAS promoter G4 reached individual significance against its mononucleotide-preserving null distribution (p = 0.042). Most decisively, each wild-type G4 scored higher on MSI than its experimentally confirmed G4-abolished mutant in 12 of 14 paired comparisons (sign test, p = 0.0065; mean {Delta}MSI = +0.089, mean {Delta}nMSI = +0.192); the two reversals (BCL2 and HIF-1) are attributable to scrambled mutant controls that introduce more balanced trinucleotide compositions rather than to failure of the index. The directional trend was reproduced across three independently published datasets, with nMSI [≥] 0.50 separating G4-forming from non-G4 sequences at 77.8% sensitivity and 100% specificity, although the collective per-sequence signal from mononucleotide-preserving shuffles remained a non-significant trend (Stouffer combined Z = 1.197, p = 0.116). This first report of trinucleotide symmetry in G4 motifs posits that coordinated nucleotide insertion and quadruplet maintenance act as an evolutionary forcing mechanism that pre-organizes single strands for G4 folding.

4

BacNeMu: neutral mutation spectra reconstruction pipeline for bacteria

Skudnov, A.; Badamshin, E.; Efimenko, B.; Popadin, K.; Gunbin, K.; Denisov, S.

2026-07-02 bioinformatics 10.64898/2026.06.30.735404 medRxiv

Top 0.1%

12.6%

Show abstract

The mutational spectrum is an increasingly important molecular phenotype that quantitatively describes mutagenesis in a given gene and species, enabling future comparative analyses to reveal differences in underlying mutagenic processes, whether internal, such as DNA repair processes, or external, such as ecological niches and conditions. Mutation accumulation experiments, although time-consuming and costly, remain the standard approach for reconstructing bacterial neutral mutation spectra. Here, we present BacNeMu, a phylogenetically informed pipeline that reconstructs neutral mutational spectra of bacterial genomes using open databases GTDB, AnnoTree and KEGG Orthology, building on previously developed NeMu pipeline. BacNeMu reconstructs mutation spectra that closely match mutation accumulation experiments results while requiring substantially less time, enabling comparative analyses across diverse bacterial taxa. Applied to obligate aerobes and anaerobes, BacNeMu recovered the expected excess of T:A>C:G transitions, consistent with oxidative-damage-associated mutational patterns previously described in mitochondrial genomes and yeast single-strand. We further asked if any other ecologic factors influence a mutational spectrum. As a pilot we compared three species living under different temperatures: one strong thermophile - Thermotoga maritima, one psychrophile - Clostridium algidicarnis, and one with intermediate temperature tolerance - Psychrobacter sanguinis. In the thermophile, the relative frequency of T:A>C:G substitutions was higher than in the psychrophile, consistent with the hypothesis that GC-biased mutagenesis contributes to thermal adaptation, although C:G>T:A transitions predominate across all three species. BacNeMu provides a rapid, phylogenetically informed framework for generating biologically meaningful mutation spectra from open databases.

5

MKMC enables reference-free transcriptomic analysis using k-mer representations

Mboning, L.; Dlugosz, M.; Kokot, M.; Chen, J.; Costa, E. K.; Wu, M.-R.; Wang, S.; Bouchard, L.-S.; Deorowicz, S.; Pellegrini, M.

2026-07-10 bioinformatics 10.64898/2026.07.06.736868 medRxiv

Top 0.1%

12.5%

Show abstract

Traditional RNA-seq analysis depends heavily on genome alignment and gene annotation, limiting its utility in non-model organisms and introducing biases that can obscure regulatory complexity. We present MKMC (Multi-sample Kmer Counter), a scalable, reference-free toolkit for RNA-seq analysis that leverages k-mer-based statistics to detect biological variation without requiring alignment. MKMC integrates fast k-mer counting, abundance matrix generation, normalization, dimensionality reduction, and differential analysis into a unified workflow. Across diverse datasets, MKMC recapitulates key biological signals--including sex differences in killifish liver--and matches alignment-based pipelines in differential expression analysis and transcriptomic age prediction. Notably, MKMC detects isoform-specific events missed by traditional methods, one of which we validated using in situ hybridization. These results reveal previously hidden isoform-level regulatory events that contribute to sex-and age-associated transcriptional programs. MKMC offers a robust, extensible alternative to alignment-based approaches, enabling transcriptomic discovery across both model and non-model systems. While we focus here on RNA-seq as a primary application, MKMC is broadly applicable to any k-mer-based analysis of next-generation sequencing data.

6

Hidden sampling biases inflate performance in gene regulatory network inference

Stock, M.; Ratajczak, F.; Bertin, P.; Hoermanseder, E.; Bengio, Y.; Hartford, J.; Falter-Braun, P.; Heinig, M.; Tong, A.; Scialdone, A.

2026-07-14 bioinformatics 10.64898/2025.12.19.695616 medRxiv

Top 0.1%

12.3%

Show abstract

Accurate reconstruction of gene regulatory networks (GRNs) from single-cell transcriptomic data remains a major methodological challenge. Recent machine learning approaches, particularly graph neural networks and graph autoencoders, have reported improved performance, yet these gains do not consistently translate to realistic biological settings. Here, we show that a key reason for that is the way negative regulatory interactions are sampled for supervised training and evaluation. We find that widely used sampling strategies introduce node-degree biases that allow models to exploit trivial graph-structural cues rather than biological signals. Across multiple benchmarks, simple degree-based heuristics match or exceed state-of-the-art graph neural network models under these biased evaluation protocols. We further introduce a degree-aware sampling approach that eliminates these artifacts and provides more reliable assessments of GRN inference methods. Our results call for standardized, bias-aware benchmarking practices to ensure meaningful progress in supervised GRN inference from single-cell RNA-seq data.

7

DDTRN: Predicting Bacterial Transcriptional Regulatory Networks Based on Gene Sequences using Dual Descriptor

Nie, P.; Ma, B.-G.

2026-07-01 bioinformatics 10.64898/2026.06.30.735580 medRxiv

Top 0.1%

12.3%

Show abstract

Accurate computational reconstruction of bacterial transcriptional regulatory network (TRN) from sequence information alone remains a fundamental challenge in systems biology, particularly for non-model organisms lacking extensive transcriptomic data. We present DDTRN, a sequence-driven framework that formulates TRN inference as a binary classification task over concatenated regulator-target gene sequence pairs and employs a Dual Descriptor (DD) model to predict regulatory interactions. The DD architecture represents a sequence into two learnable components: Composition Weight Map (CWM) and Position Weight Function (PWF). We comprehensively evaluate DDTRN against six conventional machine learning baselines across eight benchmark bacterial datasets, including E. coli (DREAM5, RegulonDB), B. subtilis, S. enterica, C. glutamicum, M. tuberculosis, P. aeruginosa, and S. coelicolor. DDTRN achieves superior overall performance, attaining average AUROC and AUPR scores of 0.869 and 0.868, respectively, with particularly pronounced advantages at lower descriptor ranks where positional weighting compensates for limited sequence context. Systematic sensitivity analyses of rank, embedding dimension, and basis function count reveal stable optimal operating regimes, while subsampling experiments demonstrate strong robustness even with limited training data. Interpretability analyses show that PWF learns distinct periodic contributions across different rank granularities and that CWM preferentially weights meaningful k-mers. A case study on E. coli dataset further illustrates that DDTRN identifies method-specific candidate targets complementary to those proposed by conventional approaches. By operating solely on genomic sequence, DDTRN provides a scalable, interpretable, and data-efficient framework for bacterial TRN inference in species where expression data are scarce, and it establishes a foundation for future multimodal integration with condition-specific regulatory information.

8

Scanning transcriptomes for nonlinear, domain-level similarities using hmSEEKR

Li, S.; Sprague, D. A.; Eberhard, Q. E.; Boyson, S. P.; Laederach, A.; Calabrese, J. M.

2026-07-08 bioinformatics 10.64898/2026.07.03.736302 medRxiv

Top 0.1%

11.9%

Show abstract

Long noncoding RNAs (lncRNAs) play roles in gene regulation across kingdoms of life. However, lncRNAs with related functions often lack linear sequence similarity, making it difficult to leverage studies of one lncRNA to inform the understanding of others. We describe a k-mer-based hidden Markov model, hmSEEKR, that enables the scanning of transcriptomes for regions of non-linear sequence similarity to a query domain, without prior knowledge of where within the transcriptome the similarities may be located. When individual lncRNA domains were used as search features, hmSEEKR successfully identified regions in other RNAs that harbor non-linear sequence similarity and bind similar sets of proteins. Applying hmSEEKR to transcriptome-wide searches, we found that certain domains within the lncRNAs XIST, NEAT1, and MALAT1 exhibited widespread regional similarity to both lncRNA and protein-coding genes, while others were more unique, exhibiting similarity to ~100 genes or fewer. Combinatorial searches uncovered RNAs containing sequential matches to core functional domains of XIST and NEAT1, and eCLIP-inferred protein-interaction networks within these RNAs more closely resembled those of XIST and NEAT1, respectively, than would be expected by chance, suggesting the searches recovered RNAs with similar biological properties. Finally, within annotated sets of cis-activating and cis-repressive lncRNAs, we observed opposing enrichments for similarity to domains associated with transcription-promoting complexes and heterogeneous nuclear ribonucleoprotein (hnRNP) binding, respectively, suggesting the enriched sequences may contribute to regulatory functions. hmSEEKR can be applied with minimal training data and enables the a priori discovery of RNA domains that share nonlinear similarity, offering a sequence-informed approach to discover functional elements within noncoding transcriptomes.

9

HeartBioPortal 3.0: an integrated cardiovascular genomics knowledge environment for molecular, clinical and population-scale interpretation

Vand, K.; Badia, N.; Khomtchouk, B.; Janga, S. C.

2026-07-01 cardiovascular medicine 10.64898/2026.06.28.26356792 medRxiv

Top 0.1%

11.8%

Show abstract

Cardiovascular genomics is producing rapidly expanding genetic, molecular, phenotypic and clinical data, yet relevant evidence remains fragmented across resources and difficult to translate into actionable biological and ultimately translational knowledge. HeartBioPortal (HBP) is a browser-based cardiovascular knowledge environment that was developed to address this problem by organizing omics, variant, phenotype and clinical evidence centered around gene queries. Here we describe HBP 3.0, a major update that expands both the data architecture and interpretive interface. This update introduces DataHub, a reproducible data-engineering layer for source ingestion, standardization, variant-centered aggregation, provenance tracking and compact serving artifacts. The release integrates cardiovascular clinical practice guideline context through a graph-backed clinical knowledge layer; incorporates cardiovascular summary statistics from the Million Veteran Program and public aggregate resources; expands source-preserving population frequency, variant annotation and structural-variant; and adds gene profile, drug-discovery and protein-context layers. HBP 3.0 incorporates 594.3 million allele-frequency observations across 18.1 million rsIDs, 3.04 million exon-enriched structural-variant records, 66.9 thousand protein isoforms with 3.26 million non-exon protein feature annotations, 17,128 gene-drug records, and a clinical guideline knowledge graph with 42,895 entities and 106,304 relationships. The redesigned gene dossier view combines phenotype filtering, annotation composition, persistent selected-detail panels and exportable chart data in one workflow. HBP 3.0 is designed to help cardiovascular and eventually cardiometabolic researchers move from a genetic or genomic signal to biological knowledge and potentially clinical and therapeutic context while preserving source provenance and interpretive boundaries. Database URL: https://www.heartbioportal.com/

10

Estimation of splicing metrics for NMD-sensitive transcripts

Zavileyskiy, L.; Vlasenok, M.; Kuznetsova, A.; Skvortsov, D. A.; Pervouchine, D. D.

2026-07-04 bioinformatics 10.64898/2026.06.30.735642 medRxiv

Top 0.2%

11.6%

Show abstract

Alternative splicing is commonly quantified using the Percent-Spliced-In (PSI) metric, which measures the relative abundances of alternatively spliced isoforms. However, some transcript isoforms are targeted by the nonsense-mediated decay (NMD) pathway, introducing a strong bias that leads to underestimation of their true splicing rates. To correct for this bias, we developed an analytical framework and a set of statistical models employing a linear fractional transformation depending on a single parameter capturing the degradation rate of NMD-sensitive transcripts relative to normal mRNA decay. Using Gaussian mixture models, we demonstrated a clear separation of splicing events into two classes, responders and non-responders, with the former exhibiting strong upregulation upon NMD inhibition and the latter showing little or no response. Moreover, non-responders displayed higher coding potential and stronger translation signals both upstream and downstream of the stop codon, which are characteristic of NMD escape through translational readthrough. We further showed that incorporation of event-specific relative decay rates improves the interpretation of differential splicing patterns for NMD-sensitive transcripts. In sum, our results provide a solid framework for unbiased estimation of splicing metrics in NMD-sensitive transcripts from short-read RNA-seq data, without requiring NMD inhibition experiments.

11

Ptolemaea: consensus, comprehensive annotation of antiviral defence systems in bacterial genomes

Campbell, E. B. T.; Skvortsov, T.; Creevey, C. J.

2026-06-28 bioinformatics 10.64898/2026.06.26.734901 medRxiv

Top 0.2%

9.8%

Show abstract

Motivation: Bacteria carry a large repertoire of antiviral defence systems, our knowledge of which is expanding rapidly. Several bioinformatics tools now exist to identify them. Though powerful, these tools can differ in the models they use and the nomenclature they return, thus a single tool could both miss an annotation and disagree with its peers. Results: Here we describe Ptolemaea, a pipeline for harmonising phage-defence annotations across multiple tools by reconciling PADLOC, DefenseFinder, and a bidirectional BLAST. Over a common predicted set of proteins, Ptolemaea provides a consensus annotation list per genome. The pipeline is not intended to outperform or replace its component tools; its purpose is to maximise the number of defence systems recovered from a genome and to make disagreements between tools explicit and resolvable. We demonstrate the pipeline on 700 complete genomes spanning the ESKAPE pathogens and Escherichia coli, recovering 32,509 defence annotations, of which 50.6% were supported by more than one annotation source. Availability: The Ptolemaea pipeline is freely available at https://github.com/ecampbell50/Ptolemaea. Supplementary information: Genome accessions used in this analysis can be found in S1, and script for genome retrieval in S2. Collated consensus annotation counts can be found in S3, while all raw tool outputs and curated decisions for each species can be found in S4-10

12

BRAID: RT-PCR-calibrated conformal intervals for splicing ΔPSI

Park, J.; Kang, K.

2026-07-07 bioinformatics 10.64898/2026.07.02.736001 medRxiv

Top 0.2%

9.6%

Show abstract

Differential splicing workflows usually report a {Delta}PSI point estimate and a statistical score, but these outputs do not directly state whether the RNA-seq estimate is close enough to an orthogonal validation measurement. We developed BRAID as a post-processing calibration step for splicing analyses. BRAID estimates RNA-seq {Delta}PSI from rMATS inclusion and skipping counts, retains the upstream caller evidence, and adds a 95% interval whose width is calibrated from empirical RNA-seq-to-RT-PCR residuals using split conformal prediction. The packaged differential-splicing calibrator uses a residual half-width of q = 0.341, estimated from 162 RT-PCR-validated skipped-exon events. We evaluated BRAID on three RT-PCR validation datasets covering TRA2 knockdown, mouse cerebellum versus liver, and a prostate epithelial-to-mesenchymal comparison. On the pooled common set of 139 cassette-exon events, BRAID reached 0.971 RT-PCR coverage, whereas MAJIQ, betAS, and rMATS-derived intervals reached 0.518, 0.734, and 0.633, respectively. BRAID also had the lowest pooled interval score, 0.720, compared with 2.040 for MAJIQ, 1.414 for betAS, and 1.625 for rMATS. Applying the same residual calibration to other caller outputs brought MAJIQ, betAS, rMATS, and SUPPA2 {Delta}PSI estimates close to nominal RT-PCR coverage, indicating that the gain came from interval calibration rather than from a caller-specific point estimate. In a TRA2 positive-negative validation panel, using q as a hard rMATS effect-size cutoff reduced recall, whereas using q as an interval half-width improved RT-PCR coverage. Applied to a public DM1 skeletal-muscle rMATS table, BRAID reduced 967 large-effect significant events to 68 high-confidence interval-supported events and retained known DM1 and muscle-splicing signals. BRAID provides a practical calibrated reliability layer for RNA-seq splicing studies where downstream follow-up depends on the precision of reported {Delta}PSI estimates.

13

elDORS: An elevated Database Of RNA Sequences

Dutta, N.; Vicens, Q.

2026-07-14 bioinformatics 10.64898/2026.07.10.737016 medRxiv

Top 0.3%

8.7%

Show abstract

Massive and integrated sequence databases have revolutionized computational protein structure prediction. RNA lags due to a lack of consolidated sequence resources. To bridge this gap, we developed elDORS_raw, which comprises up-to-date sequence information, including recent metagenomes and transcriptome sequence data. We optimized an 80% sequence-identity clustered version named elDORS for use with the RNAcmap3 split-strategy for multiple sequence alignment (MSA) and the widely used rMSA pipeline. Our benchmarking demonstrates that elDORS-augmented MSA pipelines match or exceed the alignment depth obtained with massive legacy databases across different queries, including blind CASP challenges, effectively eliminating sequence retrieval failures challenging orphan RNAs. To aid homology searches, predicting RNA properties, training new models, and other downstream tasks, elDORS is freely accessible. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=197 HEIGHT=200 SRC="FIGDIR/small/737016v1_ufig1.gif" ALT="Figure 1"> View larger version (57K): org.highwire.dtl.DTLVardef@d7549aorg.highwire.dtl.DTLVardef@f36ab1org.highwire.dtl.DTLVardef@e1a07eorg.highwire.dtl.DTLVardef@efdab4_HPS_FORMAT_FIGEXP M_FIG C_FIG

14

Large-scale automated detection reveals pervasive sex imbalance in biomedical research

Valtadoros, L. E.; Hicks, P.; Yuan, H.; Ahmadian, M.; Johnson, K. A.; Krishnan, A.

2026-07-14 genomics 10.64898/2026.07.13.738332 medRxiv

Top 0.4%

7.3%

Show abstract

Sex is a critical biological variable that impacts disease risk, progression, and treatment response across virtually every organ system. However, decades of biomedical research have relied primarily on male study subjects, leaving large gaps in our understanding of female-specific disease biology. Quantifying the extent of this imbalance across thousands of disease areas and millions of publicly available biological samples has remained computationally intractable. Here, we present a multimodal computational framework that infers the biological sex of [~]230,000 publicly available human transcriptome samples and links inferred sex labels to disease terms extracted from [~]9,000 associated study records and [~]5,000 publication abstracts to quantify sex imbalance at scale. Applying this approach revealed that the majority of disease terms with the largest research-derived sex imbalance are skewed toward male representation, including areas with no known biological justification for that imbalance. After adjusting for global sex-specific disease prevalence to isolate biologically unjustified imbalance, up to 58% of all disease terms showed male-leaning association. Diseases including glioblastoma, cirrhosis, idiopathic pulmonary fibrosis, and schizophrenia emerged as critically understudied in females despite affecting both sexes comparably. These findings provide a principled, data-driven basis for prioritizing compensatory research efforts and offer a reusable framework for ongoing monitoring of sex representation in the biomedical literature. HighlightsO_LISkewed male and female study subject representation in biomedical research is the result of decades of studies conducted without adequate female representation. C_LIO_LIWe developed an automated, multimodal framework to estimate the sex imbalance across thousands of disease terms using metadata from [~]230,000 transcriptomics samples and their associated [~]9,000 studies and [~]5,000 publications. C_LIO_LIOur approach identifies non-sex-specific disease research areas that have been studied using an unbalanced sex demographic. These areas need compensatory and balanced studies to understand sex differences. C_LI

15

kmerRRR: A k-mer based tool for functional genomics in Repeat Rich Regions

Rahmat, J.; Pham, T. M.; Larracuente, A. M.

2026-06-25 genomics 10.64898/2026.06.21.732238 medRxiv

Top 0.4%

7.2%

Show abstract

Highly repetitive sequences pose problems for genome assembly and analysis. While advances in long-read sequencing technologies have helped reveal the organization of repetitive genomic sequences at unprecedented resolution, their functional characterization remains difficult because molecular assays that probe protein-DNA interactions and characterize expression often rely on short read sequencing. The repetitive nature of these regions poses major challenges for methods relying on sequence mapping, which is exacerbated for short reads. Repetitive genome regions often have low mappability, leading to substantial information loss during downstream filtering. To address this challenge, we developed a bioinformatic tool--kmerRRR--that leverages k-mer frequency analyses to enhance the mappability of repetitive regions. KmerRRR compares k-mer frequencies within user-defined loci to their frequencies across the genome to identify repetitive sequences that are overrepresented locally relative to the global background. This approach quantifies locus uniqueness, allowing users to distinguish sequences that are globally repetitive from those that are repetitive, but restricted to specific genomic loci. We demonstrated the utility of this method by reanalyzing chromatin profiling data from human, Drosophila, and Arabidopsis centromeres and small RNA sequencing data. Our results show that incorporating local k-mer ratio information enhances read retention and signal interpretation within repetitive regions, thereby recovering biologically meaningful information that is typically lost in conventional analyses. The tool is freely available under MIT license in github: (https://github.com/LarracuenteLab/kmerRRR).

16

UKBAnalytica: an integrated R package for scalable phenotyping and reproducible epidemiological analysis within the UK Biobank Research Analysis Platform

He, N.; Mo, K.; Yu, G.; He, F.

2026-06-22 epidemiology 10.64898/2026.06.19.26356057 medRxiv

Top 0.4%

6.8%

Show abstract

UK Biobank provides longitudinal health-related data for approximately 500,000 participants, and its Research Analysis Platform (RAP) has shifted large-scale analyses toward secure cloud-based computation. However, many existing tools address only specific steps of the analytical workflow, leaving a need for an integrated framework that connects multi-source disease phenotyping, survival-ready cohort construction, and downstream analysis on the RAP. Here, we present UKBAnalytica, an extensible R package for scalable phenotyping and integrated analysis of UK Biobank data within the RAP environment. It currently includes 52 predefined baseline variables and a built-in library of 331 curated disease definitions. These definitions are based on multiple UK Biobank data sources, including ICD-10, ICD-9, self-reported conditions, death registry records, algorithmically defined outcomes, and OPCS-4 procedure codes. UKBAnalytica distinguishes prevalent and incident cases, constructs follow-up time, generates analysis-ready survival datasets, and summarizes participant flow. Beyond phenotype construction, UKBAnalytica provides integrated modules for epidemiological analysis, omics analysis, and machine-learning-based modeling and interpretation. By linking endpoint definition with downstream modeling under a consistent data structure, UKBAnalytica reduces repetitive scripting and improves analytical transparency. Furthermore, we demonstrate the package's practical utility through a case study on chronic obstructive pulmonary disease (COPD) proteomics. The findings align closely with previously reported conclusions, underscoring the robustness and reliability of our analytical framework. This phenotype-centered framework complements existing UK Biobank tools and facilitates reproducible RAP-based biomedical research. UKBAnalytica is freely available at https://github.com/Hinna0818/UKBAnalytica.

17

fastder: fast, annotation-agnostic detection of expressed regions from RNA-seq coverage and splicing data

Lehmann, M.; Kitak, T.; Mallona, I.

2026-06-25 genomics 10.64898/2026.06.21.733617 medRxiv

Top 0.4%

6.8%

Show abstract

Features from RNA-seq experiments are usually counted using a curated annotation (GENCODE, Ensembl, etc) as reference. This approach constrains RNA-seq counts to what the annotation considers a valid gene, transcript or exon, and misses transcription outside them. Moreover, reference annotations are known to be incomplete, and inadequate in some experiments altering splicing, and some diseases. To circumvent these constraints, annotation-free detection of expressed regions aims to detect the actual regions being expressed in a sample or samples. For large scale usage, the recount3 resource holds uniformly processed coverage tracks and splice junctions for over 8,000 human and over 10,000 mouse RNA-seq studies, with several hundred thousand samples in total. This opens an opportunity for scalable tools to call expressed regions that are annotation free. We present fastder, a C++ tool to detect expressed regions directly from recount3 coverage and splice junctions files, with no read alignment steps. It also can run on local raw reads, after aligning them with STAR, making it suitable to analyze species beyond human and mouse. fastder calls expressed regions by bump hunting expression coverage files, and stitches them into spliced multi-exon structures using the splice junction data. On simulated RNA-seq with unannotated expression features, fastder calls exons at a base-level accuracy on par with derfinder and golden standard for the task. Exon precision improves with sequencing depth. We compare fastder's performance against a coverage-only baseline, derfinder and groHMM. fastder extends the functionality of those by assigning strands to the expressed regions, and by also producing multi-exon regions. It runs about 15 to 20 times faster than derfinder and groHMM, within bounded memory. We showcase its use with two recount3-derived examples, including the recovery of cryptic exons from a TDP-43 knockdown experiment; and the GTEx data clustering based on the topology of the expressed regions alone, without taking the expression levels into account. fastder makes annotation-agnostic detection of expressed-regions fast enough to run at recount3 scale. As limitation, it detects expressed regions and the junctions between them, not different isoforms from the same gene.

18

VCBench: A Multi-Dimensional Benchmark for Single-Cell Foundation Models

Weidener, L. S.; Brkic, M.; Jovanovic, M.; Ulgac, E.; Meduri, A.

2026-06-23 bioinformatics 10.64898/2026.06.18.733146 medRxiv

Top 0.5%

6.7%

Show abstract

Single-cell foundation models are increasingly positioned as virtual cells, yet their capabilities are assessed by fragmented, largely single-task benchmarks that obscure where these models improve on simple baselines. VCBench addresses this by synthesizing four independent virtual-cell frameworks into seven capability dimensions: perturbation response prediction, cross-species universality, gene regulatory network (GRN) inference, modality integration, temporal dynamics, multi-scale integration, and in silico experimentation. Each dimension is assessed for operational testability under current architectures and datasets: five admit direct or proxy evaluation, while multi-scale integration and in silico experimentation are structurally untestable as end-to-end tasks. We evaluate five foundation models (Geneformer, scGPT, UCE, TranscriptFormer, Arc State) against pre-registered linear and nearest-neighbor baselines across the five testable dimensions, and report three findings. First, the baselines match or exceed every foundation model on four of the five scored dimensions, replicating the reported competitiveness of linear baselines on perturbation prediction and extending it to cross-species transfer, GRN inference, and temporal ordering. Second, TranscriptFormer alone exceeds the strongest baseline on cross-modal RNA-to-protein prediction (53% Pearson improvement, with a documented contamination caveat) and is the only model to reach Level 2 in the pre-registered Virtual Cell (VC) Level rubric; the architectural choice behind this advantage simultaneously causes a spectral collapse that destroys its temporal-ordering performance, a tradeoff invisible to single-task benchmarks. Third, no foundation model publishes a complete cell-level training manifest, leaving data contamination undetectable to users. Alongside the benchmark, VCBench releases a Contamination Reporting Schema and contributes two further methodological tools: a common-label-set protocol that controls for class-count confounds in cross-species transfer, and a spread-error correlation probe for epistemic calibration.

19

FORGE reveals an information spectrum encoded in RNA tertiary-structure geometry

Gow, L.; Li, J.; Tan, X.; Li, L.

2026-07-06 bioinformatics 10.64898/2026.07.05.736550 medRxiv

Top 0.5%

6.6%

Show abstract

Coarse RNA coordinate representations are increasingly used for inverse folding and structural annotation, but the biological information encoded in such representations is not well quantified. We introduce FORGE (Feature-engineered RNA Geometry Evaluation), a feature-engineering framework that extracts 935 interpretable descriptors from six backbone atoms and one glycosidic-anchor atom per residue. In a temporal test on 4,135 post-2025 PDB RNA chains, FORGE recovered 64.6\% of native nucleotides; a six-atom control without the glycosidic nitrogen retained 58.5\%, and abstaining from the least-confident half of calibration positions retained 94.4\% accuracy. The same representation predicted base-pair state (79.2\% accuracy), a RibonanzaNet-inferred DMS reactivity proxy ($R^2=0.329$) and protein-proximal context (AUC approximately 0.67). Native-decoy, OpenKnot and solved AI-designed pseudoknot tests further show that nucleotide identifiability, foldability and design score are distinct objectives. FORGE provides a reproducible audit layer for RNA structural interpretation.

20

Complex-valued representations of time-series gene expression profiles for network analysis

Sun, J.; Cao, W.; Ikumi, K.; Shimizu, K. K.; Sese, J.

2026-06-22 bioinformatics 10.64898/2026.06.16.732574 medRxiv

Top 0.5%

6.6%

Show abstract

Time-series RNA sequencing provides a powerful framework for studying dynamic gene regulation, yet conventional analyses usually represent gene expression profiles as real-valued vectors in Euclidean space and quantify similarity using correlation or distance. Inspired by quantum information theory, we present a framework for encoding time-series gene expression profiles as complex-valued vectors comprising amplitude and phase components in Hilbert space. We designed multiple encoding models to represent gene expression in the amplitude of complex-valued vectors, encode temporal differences in the phase, and extend the phase representation to incorporate the direction of local expression changes. Gene-gene similarity was then quantified using fidelity, which measures the overlap between two encoded vectors. Evaluation using time-series RNA-seq datasets across diverse species and biological contexts showed that different encoding models produced distinct fidelity distributions that were related to, but distinct from, conventional correlation measures. We then constructed gene-gene networks using pairwise fidelity values and detected communities containing genes with similar temporal profiles. Although fidelity distributions differed across encoding models, the resulting communities captured major temporal expression programs, and functional annotations based on gene ontology and Kyoto encyclopedia of genes and genomes pathway analyses provided exploratory biological context. The detected communities were comparable to those obtained using conventional methods, including weighted correlation network analysis and fuzzy c-means clustering. Furthermore, as a proof-of-concept, we performed SWAP-test circuit simulations to mimic fidelity computation on a quantum computer; under noise-aware conditions, these simulations produced less accurate fidelity estimates with higher computational cost than classical computation. As a proof-of-concept, this study provides a complementary view of temporal transcriptome organization, rather than a uniformly superior alternative to conventional methods.